Machine Learning 


Basic Concepts 
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Terminology 
Machine Learning, Data Science, Data Mining, Data Analysis, Sta- 


tistical Learning, Knowledge Discovery in Databases, Pattern Dis- 
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Data everywhere! 
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Google: processes 24 peta bytes of data per day. 
Facebook: 10 million photos uploaded every hour. 
Youtube: 1 hour of video uploaded every second. 
Twitter: 400 million tweets per day. 


Astronomy: Satellite data is in hundreds of PB. 


“By 2020 the digital universe will reach 44 
zettabytes...” 


The Digital Universe of Opportunities: Rich Data and the 
Increasing Value of the Internet of Things, April 2014. 
That's 44 trillion gigabytesi 


Data types 


Data comes in different sizes and also flavors (types): 
X Texts 

X Numbers 

Clickstreams 

Graphs 

Tables 

Images 

Transactions 


Videos 
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Some or all of the above! 


Smile, we are 'DATAFIED'! 


e Wherever we go, we are “datafied’’. 


Smartphones are tracking our locations. 


e We leave a data trail in our web browsing. 


Interaction in social networks. 


e Privacy is an important issue in Data Science. 


The Data Science 


@ DATA COLLECTION 
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@ DATA PREPARATION 


Data cleaning 


Feature/variable 
engineering 
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D Visualization 


Application 
deployment 
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Descriptive 
statistics, 
Clustering 
Research 

questions? 


Classification, 
scoring, predictive 
models, 
clustering, density 
estimation, etc. 
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Applications of ML 


e We all use it on a daily basis. Examples: 


Machine Learning 


e Spam filtering 

e Credit card fraud detection 

e Digit recognition on checks, zip codes 
e Detecting faces in images 

e MRI image analysis 

e Recommendation system 

e Search engines 

e Handwriting recognition 

e Scene classification 


e etc... 


Interdisciplinary field 


MIL versus Statistics 


Statistics: Machine Learning: 
e Hypothesis testing e Decision trees 
e Experimental design e Rule induction 
e Anova e Neural Networks 
e Linear regression e SVMs 
e Logistic regression e Clustering method 
e GLM e Association rules 
e PCA e Feature selection 
e Visualization 
e Graphical models 
e Genetic algorithm 


http://statweb.stanford.edu/~jhf/ftp/dm-stat. pdf 


Machine Learning definition 


“How do we create computer programs that improve with experi- 
ence?” 

Tom Mitchell 

http://videolectures.net/mlas06_mitchell_itm/ 


Machine Learning definition 


“How do we create computer programs that improve with experi- 


ence?" 
Tom Mitchell 


http://videolectures.net/mlas06 mitchell itm/ 


"A computer program is said to learn from experience E with 
respect to some class of tasks T and performance measure P, if 
its performance at tasks in T, as measured by P, improves with 
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experience E. 
Tom Mitchell. Machine Learning 1997. 


Supervised vs. Unsupervised 


Given: 


Training data: 


(21,712, ..., (Zn, Yn) / ri € Rd and Yi is the 


label. 
example x1 > 211 X12 Tid | yi + label 
example 3; > | ci 2 tia | yi label 
example Zn > Gail D Cgil Va = label 


Supervised vs. Unsupervised 


Given: Training data: (z4,y1),..., (n, yn) / ©; € Rİ and y; is the 


label. 
example z4 — 211 712 ... Fig | Yi + label 
example x; — Lil Lig ... Tid | Yi + label 


example xn — Cal Tn2 «+» ad | Yn + label 


Supervised vs. Unsupervised 


| fruit | length — | weight | label — 


Banana 
Orange 


Unsupervised learning: 
Learning a model from unlabeled data. 


Supervised learning: 
Learning a model from labeled data. 


Unsupervised Learning 


Training data: “examples” x. 
Dici CC NXOR 
e Clustering /segmentation: 


f: R => (C4,... C41 (set of clusters). 


Example: Find clusters in the population, fruits, species. 


Unsupervised learning 


Feature 2 


Feature 1 


Unsupervised learning 


Feature 2 


Feature 1 


Unsupervised learning 


Feature 2 


Feature 1 


Methods: K-means, gaussian mixtures, hierarchical clustering, 
Spectral clustering, etc. 


Supervised learning 


Training data: “examples” x with "labels" y. 
(21,91); es (Zn, Yn) / Ti = Rd 
e Classification: y is discrete. To simplify, y € {—1, +1} 


f: RÉ — {-1,+1} f is called a binary classifier. 


Example: Approve credit yes/no, spam/ham, banana/orange. 


Supervised learning 
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Feature 1 


Supervised learning 


Feature 2 


Feature 1 
Decision boundary 


Supervised learning 


Feature 2 


Feature 1 
Decision boundary 


Methods: Support Vector Machines, neural networks, decision 


trees, K-nearest neighbors, naive Bayes, etc. 


Supervised learning 


Classification: 
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Supervised learning 


Non linear classification 


Figure b 
Figure a 
2 g 


sqrt(2)x1x2 


Supervised learning 


Training data: examples” x with "labels" y. 


(1,1), 9 (zn, yn) / Li = Rd 


e Regression: y is a real value, ye R 


f: RI =>R f is called a regressor. 
Example: amount of credit, weight of fruit. 


Supervised learning 


Regression: 


Feature 1 


Example: Income in function of age, weight of the fruit in function 
of its length. 


Supervised learning 


Regression: 


Feature 1 


Supervised learning 


Regression: 


Feature 1 


Supervised learning 


Regression: 


Feature 1 


Training and Testing 


ML Algorithm 


Training and Testing 


ML Algorithm 


Income, 

gender, 

age, —— 
family status, 
zipcode 


Credit amount $ 
^ Credit yes/no 


K-nearest neighbors 


Not every ML method builds a model! 


Our first ML method: KNN. 


e Main idea: Uses the similarity between examples. 
e Assumption: Two similar examples should have same labels. 


e Assumes all examples (instances) are points in the d dimen- 
sional space Rd. 


K-nearest neighbors 


e KNN uses the standard Euclidian distance to define nearest 
neighbors. 
Given two examples x; and Tj! 


d 
d(zj, 2) = | Y (ik — ju)? 


k=1 


K-nearest neighbors 


Training algorithm: 
Add each training example (x,y) to the dataset D. 
x ERI, ye {+1,-1}. 


K-nearest neighbors 


Training algorithm: 
Add each training example (x,y) to the dataset D. 
x € RI, y € {+1,-1}. 


Classification algorithm: 


Given an example x, to be classified. Suppose N;(aq) is the set of 


the K-nearest neighbors of xg. 


= e 


jq = sign( X yi) 
ri€ Ny (Tq) 


K-nearest neighbors 


3-NN. Credit: Introduction to Statistical Learning. 


K-nearest neighbors 


3-NN. Credit: Introduction to Statistical Learning. 


Question: Draw an approximate decision boundary for K = 3? 


K-nearest neighbors 


Credit: Introduction to Statistical Learning. 


K-nearest neighbors 


Question: What are the pros and cons of K-NN? 


K-nearest neighbors 


Question: What are the pros and cons of K-NN? 
Pros: 
+ Simple to implement. 
+ Works well in practice. 
+ Does not require to build a model, make assumptions, tune 
parameters. 
+ Can be extended easily with news examples. 


K-nearest neighbors 


Question: What are the pros and cons of K-NN? 
Pros: 
+ Simple to implement. 
+ Works well in practice. 
+ Does not require to build a model, make assumptions, tune 
parameters. 
+ Can be extended easily with news examples. 


Cons: 
- Requires large space to store the entire training dataset. 
- Slow! Given n examples and d features. The method takes 
O(n x d) to run. 
- Suffers from the curse of dimensionality. 


Applications of K-NN 


1. Information retrieval. 


2. Handwritten character classification using nearest neighbor in 
large databases. 


Recommender systems (user like you may like similar movies). 


> 


Breast cancer diagnosis. 


d 


Medical data mining (similar patient symptoms). 


6. Pattern recognition in general. 


Training and Testing 


ML Algorithm 


Income, 

gender, 

age, —— 
family status, 
zipcode 


Credit amount $ 
Credit yes/no 


Question: How can we be confident about f? 


Training and Testing 


e We calculate E°" the in-sample error (training error or em- 
pirical error/risk). 


E'ren(p) = YO toss(yi, f(xi)) 
i=1 


Training and Testing 


We calculate Et" the in-sample error (training error or em- 
pirical error/risk). 


: n 
ETam(f) = Y. loss(yi, f (vi) 
i=1 
e Examples of loss functions: 


— Classification error: 


PRC | 1 if sign(y) À sign(f(2;)) 


O otherwise 


Training and Testing 


e We calculate E?" the in-sample error (training error or em- 
pirical error/risk). 


: n 
ETam(f) = Y. loss(yi, f (vi) 
i=1 
e Examples of loss functions: 


— Classification error: 


E E = | 1 if sign(y) À sign(f(2;)) 


O otherwise 


— Least square loss: 


loss(y;, f (xi)) = (yi — f(2))? 


Training and Testing 


e We calculate E?" the in-sample error (training error or em- 
pirical error/risk). 


E(f) = Y Loss(y, f (a4)) 


i—1 


e We aim to have E!®"(f) small, i.e., minimize Efretn( f) 


Training and Testing 


e We calculate E?" the in-sample error (training error or em- 
pirical error/risk). 


ENTER CF) = Y. Loss(y;, f(xi)) 
i=1 


e We aim to have E*”"ü@"r ( f) small, i.e., minimize Eran ( f) 


e We hope that E**&( f), the out-sample error (test/true error), 
will be small too. 


Overfitting /underfitting 


An intuitive example 


Structural Risk Minimization 
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Underfitting | Good models | Overfitting 


Low Complexity of the model High 


Training and Testing 


Income 
X 
X 
X 
Income 


Age 


Training and Testing 


Age 


Age 


Training and Testing 


Income 


Age 


High variance (overfitting) 


Training and Testing 


Income 


Age 


High bias (underfitting) Just right! 


Income 


Age 


High variance (overfitting) 


Avoid overfitting 


In general, use simple models! 
e Reduce the number of features manually or do feature selec- 
tion. 
e Do a model selection (ML course). 


Use regularization (keep the features but reduce their impor- 
tance by setting small parameter values) (ML course). 


e Do a cross-validation to estimate the test error. 


Regularization: Intuition 


We want to minimize: 


Classification term + C x Regularization term 


DO toss(yi, f(xi)) +C x RO) 
i=1 


Regularization: Intuition 


Ke) = Ào + ... (1) 
f(x) = Ao +A1x + Adz? ... (2) 
f(z) = Ao + Aix + Aox? + Aga” + Aux ... (3) 


Hint: Avoid high-degree polynomials. 


Train, Validation and Test 


Example: Split the data randomly into 60% for training, 20% for 
validation and 20% for testing. 


Train, Validation and Test 
TRAIN | — Emm 


1. Training set is a set of examples used for learning a model 
(e.g., a classification model). 


Train, Validation and Test 
TRAIN | — Emm 


1. Training set is a set of examples used for learning a model 
(e.g., a classification model). 


2. Validation set is a set of examples that cannot be used for 
learning the model but can help tune model parameters (e.g., 
selecting K in K-NN). Validation helps control overfitting. 


Train, Validation and Test 
TRAIN | — Emm 


1. Training set is a set of examples used for learning a model 
(e.g., a classification model). 


2. Validation set is a set of examples that cannot be used for 
learning the model but can help tune model parameters (e.g., 
selecting K in K-NN). Validation helps control overfitting. 


3. Test set is used to assess the performance of the final model 
and provide an estimation of the test error. 


Train, Validation and Test 
TRAIN =. sr 


1. Training set is a set of examples used for learning a model 
(e.g., a classification model). 


2. Validation set is a set of examples that cannot be used for 
learning the model but can help tune model parameters (e.g., 
selecting K in K-NN). Validation helps control overfitting. 


3. Test set is used to assess the performance of the final model 
and provide an estimation of the test error. 


Note: Never use the test set in any way to further tune 
the parameters or revise the model. 


K-fold Cross Validation 


A method for estimating test error using training data. 


Algorithm: 


Given a learning algorithm A and a dataset D 


Step 1: Randomly partition D into k equal-size subsets Dj,... 


Step 2: 

For 7 —110 k 
Train A on all D;, i € 1,...k and i Æ j, and get fj. 
Apply f; to D; and compute E? 


Step 3: Average error over all folds. 
k 


Y (EP) 


j=1 


Confusion matrix 


Actual Label 
Predicted Label True Pl (TP) False ee (FP) 
False Negative (FN) | True Negative (TN) 


Evaluation metrics 


Actual Label 
; Positive True Positive (TP) False Positive (FP) 
Predicted Label ES i 
False Negative (FN) | True Negative (TN) 


(TP+TN)/(TP+TN+FP+FN)| The percentage of predictions that are correct 
| Precision | TP / (TP + FP) The percentage of positive predictions that are 
correct 


The percentage of positive cases that were 
Sensitivity (Recall) TP / (TP + FN) P EE së 
predicted as positive 
The percentage of negative cases that vvere 
Specificity TN / (TN + FP) P 3 , e | 
predicted as negative 


Terminology review 


Review the concepts and terminology: 


Instance, example, feature, label, Supervised learning, unsu- 
pervised learning, classification, regression, clustering, pre- 
diction, training set, validation set, test set, K-fold cross val- 
idation, classification error, loss function, overfitting, under- 
fitting, regularization. 


Machine Learning Books 


1. Tom Mitchell, Machine Learning. 


2. Abu-Mostafa, Yaser S. and Magdon-Ismail, Malik and Lin, 
Hsuan- Tien, Learning From Data, AMLBook. 


3. The elements of statistical learning. Data mining, inference, 
and prediction T. Hastie, R. Tibshirani, J. Friedman. 


4. Christopher Bishop. Pattern Recognition and Machine Learn- 
ing. 


5. Richard O. Duda, Peter E. Hart, David G. Stork. Pattern 
Classification. Wiley. 


Machine Learning Resources 


e Major journals/conferences: ICML, NIPS, UAI, ECML/PKDD, 
JMLR, ML J, etc. 


e Machine learning video lectures: 


http: //videolectures.net/Top/Computer_Science/Machine_Learning/ 


e Machine Learning (Theory): 
http: //hunch.net/ 


e LinkedIn ML groups: "Big Data" Scientist, etc. 


e Women in Machine Learning: 


https://groups.google.com/forum/#!forum/women-in-machine-learning 


e KDD nuggets http://www.kdnuggets . com/ 


Credit 


e The elements of statistical learning. Data mining, inference, 
and prediction. 10th Edition 2009. T. Hastie, R. Tibshirani, 
J. Friedman. 

e Machine Learning 1997. Tom Mitchell. 


